Introduction

A three-choice odour-trap assay was conducted to test the attraction of mature, mated and young, virgin female flies to orange-agar media infected with Hanseniaspora uvarum, Pichia kluyveri or a sterile control.

Fly traps were assembled from 70 mL screw-capped sterile pots (TechnoPlas, South Australia) containing 15 mL of orange-agar (preservative free orange juice, 15 g/L agar). A 1.5 mL Eppendorf tube, cut 5 mm from the narrow end, was slotted through a puncture hole in the lid. The lid surrounding the trap entry was sealed with Parafilm to prevent odours escaping.

knitr::include_graphics('img/trap.jpg')

# Picture

Fly traps were inoculated with 250 ?L ofa high cell density (OD 600 ? 9) suspension of the treatment yeasts Hanseniaspora uvarum (HU) or Pichia kluyveri (PK) (on SDA growth media) or SDA media only for the control. Traps were incubated overnight at 25 °C then placed in mesh cages (30 cm × 30 cm ×30 cm) in a triangular formation with equal distance (18 cm) between each treatment.

knitr::include_graphics('img/cage.jpg')

30 female flies were released into each cage at 9:30 AM and left for 8 h, rotating the trap position hourly to control for positional bias. Trap catches were counted at the end of the 8 h period. Trials using mature and virgin females were conducted at the same time but in separate cages.

All flies were starved of sugar and protein for 12 h before commencing the experiment. Twenty replicates were carried out for each treatment, for mated and virgin female flies (5 replicates of mated and virgin flies run in parallel over 4 days).

Previous statistical analyses

Previously To account for the different distributions of ‘choosers’ in each cage, trap catches were analysed as proportions of flies making a choice. As data did not conform to a normal distribution (Kologorov- Smirnov test) following arcsin transformation, we used the non-parametric Kruskal-Wallis test (KW), and a Dunn’stest for post hoc analysis (with Šidák correction on alpha value).

Trap catches and egg counts from the two 3-choice trials were analysed as proportional data. As data did not conform to a normal distribution (Kologorov- Smirnov test) following arcsin transformation, we used the non-parametric Kruskal-Wallis test (KW), and a Dunn’stest for post hoc analysis (with Šidák correction on alpha value). The difference between virgin and adult fly catches in traps was analysed with a Mann-Whitney U test. Larval develop- ment trials. Virgin flies responded very poorly to all treatments, with traps catching significantly fewer flies than in mated experiments (means: virgin = 9 ± 2%

GLM’s and and GLMM’s

This time, i wished to reanalyse the data within the framework of geenralized linear modelling, incorporating fixed and random effects

In broad terms, fixed effects are variables that we expect will have an effect on the dependent/response variable: they’re what you call explanatory variables in a standard linear regression. In our case, we are interested in making conclusions about how treatmeent impacts the insect choice. So treatment is a fixed effect and choice is the dependent variable.

On the other hand, random effects are usually grouping factors for which we are trying to control. They are always categorical. A lot of the time we are not specifically interested in their impact on the response variable, but we know that they might be influencing the patterns we see.

You generally want your random effect to have at least five levels because estimating variance on few data points is very imprecise. If you only have two or three levels, the model will struggle to partition the variance - it will give you an output, but not necessarily one you can trust.

So, for instance, if we want to control for the effects of mating status on choice, we have to fit mating statuse (a two level factor: mated or unmated) as a fixed, not random, effect.

#Set required packages
.cran_packages <- c("tidyverse",
                    "lme4",
                    "MASS",
                    "emmeans",
                    "patchwork",
                    "performance",
                    "see",
                    "parameters",
                    "glmmTMB")

.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
   install.packages(.cran_packages[!.inst])
}
#Load all packages
sapply(.cran_packages, require, character.only = TRUE)
##   tidyverse        lme4        MASS     emmeans   patchwork performance 
##        TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
##         see  parameters     glmmTMB 
##        TRUE        TRUE        TRUE
# install and load github packages
devtools::install_github("easystats/performance")
devtools::install_github("easystats/report")
library(performance)
library(report)

Read in dataset

choice <- read_csv("qfly_yeast_choice.csv")
print(choice)
## # A tibble: 40 x 4
##    cage  control    HU    PK
##    <chr>   <dbl> <dbl> <dbl>
##  1 V1          1     1     0
##  2 V2          3     0     0
##  3 V3          3     0     0
##  4 V4          3     0     0
##  5 V5          5     2     0
##  6 V6          0     0     0
##  7 V7          0     0     0
##  8 V8          4     0     4
##  9 V9          1     0     0
## 10 V10         1     0     0
## # … with 30 more rows

Turn data into long format & add new columns

choice_long <- choice %>%
  pivot_longer(cols = 2:4,
               names_to = "treatment",
               values_to = "value"
               ) %>%
  mutate(mated = case_when(
    str_detect(cage, "V") ~ "Virgin",
    str_detect(cage, "M") ~ "Mated"
  )) %>% 
  group_by(cage) %>%
  mutate(total= sum(value)) %>% # add total choosers
  ungroup()

print(choice_long)
## # A tibble: 120 x 5
##    cage  treatment value mated  total
##    <chr> <chr>     <dbl> <chr>  <dbl>
##  1 V1    control       1 Virgin     2
##  2 V1    HU            1 Virgin     2
##  3 V1    PK            0 Virgin     2
##  4 V2    control       3 Virgin     3
##  5 V2    HU            0 Virgin     3
##  6 V2    PK            0 Virgin     3
##  7 V3    control       3 Virgin     3
##  8 V3    HU            0 Virgin     3
##  9 V3    PK            0 Virgin     3
## 10 V4    control       3 Virgin     3
## # … with 110 more rows
## Look at number that made a choice
choice_long %>%
  ggplot(aes(x=mated, y=total, fill=mated)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(aes(fill=mated), colour="black", shape=21) +
  ylab("Total flies making choice") + 
  xlab("Mating status")

Check distribution

## Have a look at the data distribution:
ggplot(choice_long, aes(x=value, fill=treatment)) +
  geom_histogram()

## faceted
ggplot(choice_long, aes(x=value, fill=treatment)) +
  geom_histogram() + 
  facet_grid(treatment~mated)

Looks like a poisson distribution to me, which is common for counts.

Fit a poisson regression

One way to analyse this data would be to fit a linear model to all our data, ignoring the cage effects for now.

pois_glm <- glm(value ~ treatment, family = poisson, data = choice_long)

#Get model parameters
parameters(pois_glm)
## Parameter      | Coefficient |   SE |         95% CI |     z |  df |      p
## ---------------------------------------------------------------------------
## (Intercept)    |        1.31 | 0.08 | [ 1.14,  1.47] | 15.92 | 117 | < .001
## treatment [HU] |       -1.70 | 0.21 | [-2.13, -1.31] | -8.13 | 117 | < .001
## treatment [PK] |        0.47 | 0.10 | [ 0.27,  0.68] |  4.49 | 117 | < .001
# perfomance checks using performance package
check_model(pois_glm)
## Not enough model terms in the conditional part of the model to check for multicollinearity.

model_performance(pois_glm)
## # Indices of model performance
## 
##    AIC |    BIC | R2_Nagelkerke | RMSE | SCORE_LOG | SCORE_SPHERICAL
## --------------------------------------------------------------------
## 783.43 | 791.79 |          0.80 | 2.12 |     -3.24 |            0.06
#Finescale checks
check_overdispersion(pois_glm)
## # Overdispersion test
## 
##        dispersion ratio =   4.920
##   Pearson's Chi-Squared = 575.601
##                 p-value = < 0.001
check_homogeneity(pois_glm)
## Warning: Variances differ between groups (Bartlett Test, p = 0.000).
check_normality(pois_glm)
## Warning: Non-normality of residuals detected (p = 0.000).
check_outliers(pois_glm)
## OK: No outliers detected.

Add mating status as fixed effect

mated_glm <- glm(value~treatment + mated, family = poisson, data = choice_long)

#Get model parameters
parameters(mated_glm)
## Parameter      | Coefficient |   SE |         95% CI |      z |  df |      p
## ----------------------------------------------------------------------------
## (Intercept)    |        1.86 | 0.08 | [ 1.69,  2.02] |  21.96 | 116 | < .001
## treatment [HU] |       -1.70 | 0.21 | [-2.13, -1.31] |  -8.13 | 116 | < .001
## treatment [PK] |        0.47 | 0.10 | [ 0.27,  0.68] |   4.49 | 116 | < .001
## mated [Virgin] |       -1.85 | 0.14 | [-2.14, -1.58] | -12.87 | 116 | < .001
# perfomance checks using performance package
check_model(mated_glm)

model_performance(mated_glm)
## # Indices of model performance
## 
##    AIC |    BIC | R2_Nagelkerke | RMSE | SCORE_LOG | SCORE_SPHERICAL
## --------------------------------------------------------------------
## 541.81 | 552.96 |          0.98 | 1.57 |     -2.22 |            0.06
#Finescale checks
check_overdispersion(mated_glm)
## # Overdispersion test
## 
##        dispersion ratio =   2.632
##   Pearson's Chi-Squared = 305.255
##                 p-value = < 0.001
check_homogeneity(mated_glm)
## Warning: Variances differ between groups (Bartlett Test, p = 0.000).
check_normality(mated_glm)
## Warning: Non-normality of residuals detected (p = 0.004).
check_outliers(mated_glm)
## OK: No outliers detected.

Poisson distribution assumes that the mean and variance are the same. Check if the data shows extra variation that is greater than the mean or ‘overdispersed’

An alternative to poisson distribution is negative binomial, which is more flexible as it adjusts the variance independently from the mean.

check_distribution(pois_glm)
## # Distribution of Model Family
## 
## Predicted Distribution of Residuals
## 
##       Distribution Probability
##             normal         38%
##  negative binomial         28%
##      beta-binomial         12%
## 
## Predicted Distribution of Response
## 
##                Distribution Probability
##               beta-binomial         59%
##  neg. binomial (zero-infl.)         31%
##                         chi          3%
plot(check_distribution(pois_glm))

Fit a negative binomial regression

nb_glm <- glm.nb(value~treatment + mated, data = choice_long)

#Get model parameters
parameters(nb_glm)
## Parameter      | Coefficient |   SE |         95% CI |     z |  df |      p
## ---------------------------------------------------------------------------
## (Intercept)    |        2.01 | 0.16 | [ 1.67,  2.36] | 12.23 | 116 | < .001
## treatment [HU] |       -1.87 | 0.28 | [-2.44, -1.32] | -6.61 | 116 | < .001
## treatment [PK] |        0.20 | 0.21 | [-0.22,  0.61] |  0.94 | 116 | 0.348 
## mated [Virgin] |       -1.83 | 0.21 | [-2.25, -1.42] | -8.84 | 116 | < .001
#check model performance
check_model(nb_glm)

model_performance(nb_glm)
## # Indices of model performance
## 
##    AIC |    BIC | R2_Nagelkerke | RMSE | SCORE_LOG | SCORE_SPHERICAL
## --------------------------------------------------------------------
## 463.06 | 476.99 |          0.82 | 1.02 |     -1.90 |            0.07
#Finescale checks
check_homogeneity(nb_glm)
## Warning: Variances differ between groups (Bartlett Test, p = 0.000).
check_normality(nb_glm)
## Warning: Non-normality of residuals detected (p = 0.001).

Fit beta binomial regressions of proportions

Alternatively modelling the counts as a proportion of those that made a choice may be better, the negative binomial distribution is not appropriate for proportion data, or count data with an upper bound. Binomial (or beta-binomial, if you want to allow for additional variation) are appropriate for proportions expressed as counts

bb_glm <- glmmTMB(cbind(value, total-value) ~ treatment + mated, data = choice_long, family=betabinomial)

#Get model parameters 
parameters(bb_glm)
## Parameter      | Coefficient |   SE |         95% CI |     z |  df |      p
## ---------------------------------------------------------------------------
## (Intercept)    |       -0.23 | 0.22 | [-0.65,  0.19] | -1.07 | 115 | 0.283 
## treatment [HU] |       -1.95 | 0.33 | [-2.60, -1.29] | -5.83 | 115 | < .001
## treatment [PK] |        0.20 | 0.29 | [-0.37,  0.76] |  0.69 | 115 | 0.489 
## mated [Virgin] |       -0.08 | 0.27 | [-0.62,  0.46] | -0.30 | 115 | 0.767
model_performance(bb_glm)
## Can't calculate log-loss.
## Can't calculate proper scoring rules for models without integer response values.
## # Indices of model performance
## 
##    AIC |    BIC | RMSE | LOGLOSS
## --------------------------------
## 410.69 | 424.63 | 0.29 |

What about independance between cages?

However, what about observation independence? We collected data in different cages. While these flies are all from the same laboratory population, there was a different distribution of overall ‘choosers’ per cage. I believe that this was a behavioural thing, ie once one fly had chosen others followed. Or there could have been female fighting

knitr::include_graphics('img/cages.jpg')

#Plot seperated
choice_long %>% 
  mutate(cage = str_sub(cage, 2,3)) %>%
  ggplot(aes(x = treatment, y =  value, fill= treatment )) +
  geom_point(colour="black", shape=21, size=2) +
  facet_grid(mated ~ cage) +
  xlab("length") + ylab("test score")

Mixed effect modelling

A mixed model is a might be an option here: it will allow us to use all the data we have (higher sample size) and account for the correlations between data coming from the cages. We will also estimate fewer parameters and avoid problems with multiple comparisons that we would encounter while using separate regressions.

mixed_glmer <- glmer.nb(value ~ treatment + mated + (1|cage), data = choice_long)

#Get model parameters
parameters(mixed_glmer)
## Parameter      | Coefficient |   SE |         95% CI |     z |  df |      p
## ---------------------------------------------------------------------------
## (Intercept)    |        2.01 | 0.18 | [ 1.66,  2.35] | 11.27 | 114 | < .001
## treatment [HU] |       -1.87 | 0.29 | [-2.43, -1.30] | -6.50 | 114 | < .001
## treatment [PK] |        0.20 | 0.21 | [-0.22,  0.61] |  0.92 | 114 | 0.356 
## mated [Virgin] |       -1.83 | 0.21 | [-2.24, -1.42] | -8.72 | 114 | < .001
#check model performance
check_model(mixed_glmer)

model_performance(mixed_glmer)
## # Indices of model performance
## 
##    AIC |    BIC | R2_conditional | R2_marginal | ICC | RMSE | SCORE_LOG | SCORE_SPHERICAL
## -----------------------------------------------------------------------------------------
## 465.06 | 481.78 |           0.71 |        0.71 |   0 | 1.02 |     -1.90 |            0.07
#Finescale checks
check_homogeneity(mixed_glmer)
## Warning: Variances differ between groups (Bartlett Test, p = 0.000).

Variance of cage is singular indicating that any cage variance is coming from the mating status, and therefore not worth including as mixed effect. The lesson here is that although there is “obviously” variation in cage responders, the extent of this cage variation can be fully or virtually-fully explained by just the residual variance term alone. There is not enough additional cage-level variation to warrant adding an additional cage-level random effect to explain all the observed variation.

Compare model performance

compare_performance(pois_glm, mated_glm, nb_glm, bb_glm, mixed_glmer, rank = TRUE)
## # Comparison of Model Performance Indices
## 
## Model       |     Type |    AIC |    BIC | RMSE |          BF | Performance_Score
## ---------------------------------------------------------------------------------
## bb_glm      |  glmmTMB | 410.69 | 424.63 | 0.29 | 5.33572e+79 |           100.00%
## nb_glm      |   negbin | 463.06 | 476.99 | 1.02 | 2.27629e+68 |            57.91%
## mixed_glmer | glmerMod | 465.06 | 481.78 | 1.02 | 2.07796e+67 |            57.45%
## mated_glm   |      glm | 541.81 | 552.96 | 1.57 | 7.27517e+51 |            39.98%
## pois_glm    |      glm | 783.43 | 791.79 | 2.12 |        1.00 |             0.00%
## 
## Model bb_glm (of class glmmTMB) performed best with an overall performance score of 100.00%.

Finaly model hypothesis testing

# refit my final model
final_model <- glmmTMB(cbind(value, total-value) ~ treatment + mated, data = choice_long, family=betabinomial)

#Get model parameters
parameters(final_model)
## Parameter      | Coefficient |   SE |         95% CI |     z |  df |      p
## ---------------------------------------------------------------------------
## (Intercept)    |       -0.23 | 0.22 | [-0.65,  0.19] | -1.07 | 115 | 0.283 
## treatment [HU] |       -1.95 | 0.33 | [-2.60, -1.29] | -5.83 | 115 | < .001
## treatment [PK] |        0.20 | 0.29 | [-0.37,  0.76] |  0.69 | 115 | 0.489 
## mated [Virgin] |       -0.08 | 0.27 | [-0.62,  0.46] | -0.30 | 115 | 0.767
## alternatively, filter out virgins and fit final model

choice_filtered <- choice_long %>%
  filter(mated == "Mated")

filtered_final <- glmmTMB(cbind(value, total-value) ~ treatment , data = choice_filtered, family=betabinomial)

#Get model parameters
parameters(filtered_final)
## Parameter      | Coefficient |   SE |         95% CI |     z | df |      p
## --------------------------------------------------------------------------
## (Intercept)    |       -0.74 | 0.23 | [-1.19, -0.30] | -3.25 | 56 | 0.001 
## treatment [HU] |       -1.39 | 0.37 | [-2.12, -0.66] | -3.75 | 56 | < .001
## treatment [PK] |        1.12 | 0.32 | [ 0.50,  1.74] |  3.52 | 56 | < .001

Reporting

The report package’s primary goal is to bridge the gap between R’s output and the formatted results contained in your manuscript. It automatically produces reports of models and dataframes according to best practice guidelines (e.g., APA’s style guide), ensuring standardization and quality in results reporting.

This is not currently implemented for glmmTMB models, so as an example i will show it for the negative binomial model

library(report)
report(nb_glm)
## We fitted a general linear model (Negative Binomial(1.8933) family with a log link) (estimated using ML) to predict value with treatment and mated (formula = value ~ treatment + mated). Standardized parameters were obtained by fitting the model on a standardized version of the dataset. Effect sizes were labelled following Funder's (2019) recommendations.The model's explanatory power is substantial (Nagelkerke's R2 = 0.82). The model's intercept, corresponding to value = 0, treatment =  and mated = , is at 2.01 (SE = 0.16, 95% CI [1.67, 2.36], p < .001). Within this model:
## 
##   - The effect of treatmentHU is negative and can be considered as very large and significant (beta = -1.87, SE = 0.28, 95% CI [-2.44, -1.32], std. beta = -1.87, p < .001).
##   - The effect of treatmentPK is positive and can be considered as very small and not significant (beta = 0.20, SE = 0.21, 95% CI [-0.22, 0.61], std. beta = 0.20, p = 0.348).
##   - The effect of matedVirgin is negative and can be considered as very large and significant (beta = -1.83, SE = 0.21, 95% CI [-2.25, -1.42], std. beta = -1.83, p < .001).